Chapter 4 Results
4.1 Analysis of Player Heights
4.1.1 Overall Distribution of Male and Female Heights
We are first going to look at the overall distribution of players’ heights for both male and female. We obviously expect the median height of the men to be larger than the median height of women. For this analysis, we will only use the most recent year of data we have, specifically 2022.

## [1] "Median Male Player Height: 181 cm"
## [1] "Median Female Player Height: 170 cm"
We can clearly see that the average male player is taller than the average female player. We can also see that the heights of both the male and female players are approximately normally distributed. The median male player height is 181 cm, while the median female player height is 170 cm.
4.1.2 Normality of Player Heights
Let’s create a Q-Q plot for the males and females to confirm our hypothesis that the heights of each gender are normally distributed.

We can see that both the female and male players’ heights are approximately normally distributed. The sample heights align very well with the theoretical heights based on a normal distribution. We can not do a Shapiro-Wilk test in this case since we have far more than 5,000 samples, but we can clearly see the approximate normal distribution from the Q-Q plot. This is not surprising, as we know that heights for the general population are normally distributed.
4.1.3 Median Player Height per Position
Let’s get to the more interesting part of the analysis. We will visualize the median heights of both female and male players per position to see if there are some positions for which height is more of an asset than for others. We will again be using the data from 2022 only. Since most players play more than one position, we will include a row in our data for each position that the player plays. Thus, some (if not most) players will be included in the heigh caluclation for multiple positions.
The Cleveland Dot Plot above provides many interesting insights. First, we can confirm the above conclusion that the male players are taller than the female players. Further, we can see that this is true for every single position.
For both the males and the females, the tallest player on the pitch tends to be the goalkeeper. The goalkeeper is in charge of protecting the net, thus, it makes sense for a goalkeeper to be tall in order to “cover” as much of the goal as possible. We also see that the 2nd, 3rd, and 4th tallest positions for the men are the center back (CB), striker (ST), and center defensive midfielder (CDM). These positions are also among the tallest for the women as well. The tallest players (besides the goalkeeper), tend to be the defensive players (center back (CB), center defensive midfielder (CDM), left back (LB), left wing back (LWB), etc.). This also is in agreement with what we know about football; the defensive players must be able to win headers to protect the goal, which height helps. Strikers are among the only offensive players to have large heights, for both the men and the women. Strikers tend to play in the box, meaning they must compete against the tall defensive players for headers. The smallest positions on the pitch tend to be the midfielders and offensive wing players (CAM, LM, RM, RW, LF, CF, etc.). Again, this makes sense. The midfielders and offensive wing players play in the middle and edges of the pitch. They need to be fast with the ball and be able to deliver crosses into the box. Thus, height is not as crucial as speed is for these positions.
4.1.4 Distribution of Player Height per Position
Now, let’s take a look at the distribution of player heights per position. With the Cleveland Dot Plot we can see the median player height, but using a Ridgeline plot will also us to identify the modality of heights for each position.
We again can see that goalkeepers and defensive players tend to be the tallest. The distribution for each position for the male players appears to be normally distributed. Interestingly, there seems to be some bimodality and uniform distribution of the heights for some of the female player positions. For example, there seems to be two groups of players for the RB position, one who are taller and ones who are shorter. For positions like CAM, RW, ST, and CM, the distribution seems almost uniform, with heights ranging over a 15 or 20 cm range. It seems that the heights of players might not be as crucial of an attribute for the women’s game as speed, for example. The women’s game is often played more on the ground than in the air, so speed and quickness could be more advantageous than height regardless of the position (except goalkeeper and some defensive positions).
4.2 Explaining Player Wages
In this part, we want to identify what features are the most important to predict a players salary. The scope of the study will focus on the male players during the year 2021-22, excluding the goalkeepers.
4.2.1 Explaining wages by a simple scatter plot
First, let’s have an insight by plotting the wages in function of the “overall” rating (the global level of a player) and his age.
<<<<<<< HEAD ======= >>>>>>> b0272877face21c11ce5805705be4bacf51de0a6We can see from this graph that the wages grow with the overall rating of the player, that sounds natural. On the contrary, the age does not look like having a strong influence on the wages. Now, let’s perform a linear regression to compare the influence of each feature.
4.2.2 Explaining wages by Linear Regression
We are trying to explain the wages given the data available in the other columns through a linear regression. When done, we keep only the 30 most significant coefficients (with the lowest p-value) and we plot them in a Cleveland dot plot to compare them.
4.2.2.1 Regression 1

We can see that the club name is clearly the best indicator to estimate the salary of a player. Indeed, clubs like PSG or Manchester City are well known to be extremely rich, because they benefit from the support of a country (Qatar and Saudi Arabia). Therefore, they are able to pay their players a lot, and this seems to have a much higher influence than the actual level of the player. Once this observation is done, let’s remove the feature “club_name” to observe the influence of the others.
4.2.2.2 Regression 2

Now, we can observe that features with the highest influence come from “league_name”, that is to say the country the player plays in. Indeed, the English Premier League (and the other big european championships) is famous for getting huge amounts of money from broadcasting rights and therefore overpaying its players. Once this observation is done, let’s remove the feature “league_name” to observe the influence of the others.
4.2.2.3 Regression 3

Here, we can notice that the position of the player becomes to most important feature. For instance, Center Forwards (CF) are much better paid than Right Defensive Midfielders (RDM), probably because the position is more “spectacular” and televisual. We can also remark that the international reputation is a very important feature to explain the salary. It can be explained by the fact that football is an open competitive market: if a player with a good reputation is not well paid, he is likely to get offers from other clubs and change during the summer. Once this observation is done, let’s remove the feature “club_position” to observe the influence of the others.
4.3 Explaining Player Values
Now, we want to identify what features are the most important to predict a players value in the transfer market.
4.3.1 Explaining values by a simple scatter plot
First, let’s have an insight by plotting the values in function of the “overall” rating and his age.
<<<<<<< HEAD ======= >>>>>>> b0272877face21c11ce5805705be4bacf51de0a6We can see from this graph that the values grow with the overall rating of the player, that sounds natural (the more gifted a player, the higher his value). But unlike the wages study, here the age seems to have a strong influence on the values! Indeed, for the same overall rating, younger players clearly have a higher value.
Now, let’s perform a linear regression to compare the influence of each feature.
4.3.2 Explaining values by Linear Regression
We are trying to explain the values given the data available in the other columns through a linear regression. When done, we keep only the 30 most significant coefficients (with the lowest p-value) and we plot them in a Cleveland dot plot to compare them.
4.3.2.1 Regression 1

Like for the last study, the club name is the best way to estimate the value of a player. Indeed, it is known that the most prestigious clubs are able to attract the most valuable players. Once this observation is done, let’s remove the feature “club_name” to observe the influence of the others.
4.3.2.2 Regression 2

Now, like before, we can observe that features with the highest influence come from “league_name”, that is to say the country the player plays in. It is interesting to notice that the first league is not the English Premier League, but the Indian Super League, that is not a prestigious championship! It means that this league is probably a pool of young talents with high value, leaving the country when they reach a certain fame. Once this observation is done, let’s remove the feature “league_name” to observe the influence of the others.

Here, we can notice that the position of the player becomes to most important feature. Center Forwards may be the highest paid position, it is not the one with the highest value! Left Wingers (LW) are more valuable. Once this observation is done, let’s remove the feature “club_position” to observe the influence of the others.
Finally, we are left with only the pure skills and personality of the player. Once again, the international reputation was important, but we can now observe that ratings like “overall” or “movement_balance” are also good indicators. Finally, your age seems to strongly negatively impact your value, as we explained earlier.
4.4 Clustering the players
In this part, let’s try to cluster the players into 3 groups, visualize them in the 2D plane generated by the 2 principal components of the PCA and infer some interesting thoughts.
<<<<<<< HEADWe can see that the clusters are very well separated in the pricipal components plane (explaining 70% of the variance). By hovering a little bit, we realize that the well-known players are in the leftmost cluster (number 3). Probably the left-to-right vector (pca_1) conveys the meaning of “fame” or “talent”. Let’s have a look on the centroids of the clusters to be sure.
## overall potential value_eur age height_cm weight_kg league_level
## 1 71.01614 73.90673 6040974 26.87958 179.2346 73.56735 1.252483
## 2 62.77612 69.20643 1082317 24.37759 183.1085 76.13407 1.413590
## 3 62.90256 70.06386 1107670 23.52504 178.9592 72.58113 1.425854
## club_jersey_number weak_foot skill_moves international_reputation pace
## 1 18.25760 3.141837 2.851024 1.227188 70.67287
## 2 20.34325 2.766575 2.040771 1.019835 62.00184
## 3 24.54328 3.072593 2.613936 1.009317 71.77504
## shooting passing dribbling defending physic attacking_crossing
## 1 60.40813 66.40705 69.82464 59.37058 68.82651 63.92070
## 2 36.26887 49.10560 52.83104 61.05014 66.67493 44.81855
## 3 59.19216 54.56347 63.74495 32.19449 57.82395 51.34666
## attacking_finishing attacking_heading_accuracy attacking_short_passing
## 1 57.56487 58.45531 70.41620
## 2 32.04371 58.10946 57.28007
## 3 60.30221 52.31619 59.16324
## attacking_volleys skill_dribbling skill_curve skill_fk_accuracy
## 1 53.50450 69.05897 62.11453 55.63206
## 2 31.90762 48.93462 37.94141 34.63306
## 3 52.63645 63.50776 52.04794 45.18925
## skill_long_passing skill_ball_control movement_acceleration
## 1 65.98309 70.62135 70.97408
## 2 51.36529 54.87567 61.48577
## 3 49.77155 63.25000 71.84278
## movement_sprint_speed movement_agility movement_reactions movement_balance
## 1 70.41077 72.02173 68.24069 70.88532
## 2 62.40459 57.24573 58.13095 60.05326
## 3 71.68828 69.64849 57.83773 69.00738
## power_shot_power power_jumping power_stamina power_strength power_long_shots
## 1 67.41868 67.67381 74.27250 67.25574 61.43808
## 2 46.66079 67.84224 64.81708 69.16951 34.38329
## 3 62.07104 61.41731 61.19818 59.89422 55.54484
## mentality_aggression mentality_interceptions mentality_positioning
## 1 66.08271 59.73588 64.07604
## 2 62.37741 60.37117 40.06814
## 3 47.39616 28.46914 60.52562
## mentality_vision mentality_penalties mentality_composure
## 1 65.83271 56.53352 67.59885
## 2 43.68007 40.46905 53.73866
## 3 56.17042 56.81464 57.17857
## defending_marking_awareness defending_standing_tackle
## 1 58.67489 60.60475
## 2 60.22608 63.13664
## 3 30.65450 30.51184
## defending_sliding_tackle ls st rs lw lf
## 1 57.37322 64.79469 64.79469 64.79469 67.09590 66.74441
## 2 61.10211 48.59229 48.59229 48.59229 49.34307 48.76492
## 3 28.78106 60.69332 60.69332 60.69332 61.50543 61.45264
## cf rf rw lam cam ram lm lcm
## 1 66.74441 66.74441 67.09590 67.45034 67.45034 67.45034 67.84171 67.23681
## 2 48.76492 48.76492 49.34307 49.61892 49.61892 49.61892 51.37668 52.34986
## 3 61.45264 61.45264 61.50543 60.68207 60.68207 60.68207 60.69604 55.45419
## cm rcm rm lwb ldm cdm rdm rwb
## 1 67.23681 67.23681 67.84171 65.51366 65.15720 65.15720 65.15720 65.51366
## 2 52.34986 52.34986 51.37668 58.02681 58.74490 58.74490 58.74490 58.02681
## 3 55.45419 55.45419 60.69604 48.77892 46.17139 46.17139 46.17139 48.77892
## lb lcb cb rcb rb
## 1 64.39463 62.41962 62.41962 62.41962 64.39463
## 2 59.04408 61.46428 61.46428 61.46428 59.04408
=======
We can see that the clusters are very well separated in the pricipal components plane (explaining 70% of the variance). By hovering a little bit, we realize that the well-known players are in the leftmost cluster (number 3). Probably the left-to-right vector (pca_1) conveys the meaning of “fame” or “talent”. Let’s have a look on the centroids of the clusters to be sure.
## overall potential value_eur age height_cm weight_kg league_level
## 1 62.77612 69.20643 1082317 24.37759 183.1085 76.13407 1.413590
## 2 71.01614 73.90673 6040974 26.87958 179.2346 73.56735 1.252483
## 3 62.90256 70.06386 1107670 23.52504 178.9592 72.58113 1.425854
## club_jersey_number weak_foot skill_moves international_reputation pace
## 1 20.34325 2.766575 2.040771 1.019835 62.00184
## 2 18.25760 3.141837 2.851024 1.227188 70.67287
## 3 24.54328 3.072593 2.613936 1.009317 71.77504
## shooting passing dribbling defending physic attacking_crossing
## 1 36.26887 49.10560 52.83104 61.05014 66.67493 44.81855
## 2 60.40813 66.40705 69.82464 59.37058 68.82651 63.92070
## 3 59.19216 54.56347 63.74495 32.19449 57.82395 51.34666
## attacking_finishing attacking_heading_accuracy attacking_short_passing
## 1 32.04371 58.10946 57.28007
## 2 57.56487 58.45531 70.41620
## 3 60.30221 52.31619 59.16324
## attacking_volleys skill_dribbling skill_curve skill_fk_accuracy
## 1 31.90762 48.93462 37.94141 34.63306
## 2 53.50450 69.05897 62.11453 55.63206
## 3 52.63645 63.50776 52.04794 45.18925
## skill_long_passing skill_ball_control movement_acceleration
## 1 51.36529 54.87567 61.48577
## 2 65.98309 70.62135 70.97408
## 3 49.77155 63.25000 71.84278
## movement_sprint_speed movement_agility movement_reactions movement_balance
## 1 62.40459 57.24573 58.13095 60.05326
## 2 70.41077 72.02173 68.24069 70.88532
## 3 71.68828 69.64849 57.83773 69.00738
## power_shot_power power_jumping power_stamina power_strength power_long_shots
## 1 46.66079 67.84224 64.81708 69.16951 34.38329
## 2 67.41868 67.67381 74.27250 67.25574 61.43808
## 3 62.07104 61.41731 61.19818 59.89422 55.54484
## mentality_aggression mentality_interceptions mentality_positioning
## 1 62.37741 60.37117 40.06814
## 2 66.08271 59.73588 64.07604
## 3 47.39616 28.46914 60.52562
## mentality_vision mentality_penalties mentality_composure
## 1 43.68007 40.46905 53.73866
## 2 65.83271 56.53352 67.59885
## 3 56.17042 56.81464 57.17857
## defending_marking_awareness defending_standing_tackle
## 1 60.22608 63.13664
## 2 58.67489 60.60475
## 3 30.65450 30.51184
## defending_sliding_tackle ls st rs lw lf
## 1 61.10211 48.59229 48.59229 48.59229 49.34307 48.76492
## 2 57.37322 64.79469 64.79469 64.79469 67.09590 66.74441
## 3 28.78106 60.69332 60.69332 60.69332 61.50543 61.45264
## cf rf rw lam cam ram lm lcm
## 1 48.76492 48.76492 49.34307 49.61892 49.61892 49.61892 51.37668 52.34986
## 2 66.74441 66.74441 67.09590 67.45034 67.45034 67.45034 67.84171 67.23681
## 3 61.45264 61.45264 61.50543 60.68207 60.68207 60.68207 60.69604 55.45419
## cm rcm rm lwb ldm cdm rdm rwb
## 1 52.34986 52.34986 51.37668 58.02681 58.74490 58.74490 58.74490 58.02681
## 2 67.23681 67.23681 67.84171 65.51366 65.15720 65.15720 65.15720 65.51366
## 3 55.45419 55.45419 60.69604 48.77892 46.17139 46.17139 46.17139 48.77892
## lb lcb cb rcb rb
## 1 59.04408 61.46428 61.46428 61.46428 59.04408
## 2 64.39463 62.41962 62.41962 62.41962 64.39463
>>>>>>> b0272877face21c11ce5805705be4bacf51de0a6
## 3 46.67023 42.25214 42.25214 42.25214 46.67023
The third cluster (blue) has a higher “overall”, “value” and “wage” than the others. It is therefore constituted by the good and famous players. The main difference between the 2 other clusters is the height and the weight: the second cluster (red) being taller and heavier than the first cluster (green). We can therefore guess that the bottom-to-top vector (pca_2) conveys a meaning of corpulence or, maybe, of position.
Finally, let’s have a look at the decomposition of the PCA vectors on the original features.
## PC1 PC2
## overall -0.140523693 0.086703572
## potential -0.090435065 0.035016238
## value_eur -0.084353301 0.036262081
## age -0.061356100 0.069723915
## height_cm 0.042433553 0.077178168
## weight_kg 0.021479732 0.077799757
## league_level 0.025426437 -0.012857498
## club_jersey_number 0.018948503 -0.040080451
## weak_foot -0.058222787 -0.029819159
## skill_moves -0.116265304 -0.063005117
## international_reputation -0.071856297 0.035910279
## pace -0.070280469 -0.091149886
## shooting -0.143777955 -0.104736618
## passing -0.164980344 0.028876908
## dribbling -0.165759524 -0.045548935
## defending -0.009205539 0.230880807
## physic -0.040105643 0.155624325
## attacking_crossing -0.133899240 0.002892932
## attacking_finishing -0.128664075 -0.128067284
## attacking_heading_accuracy -0.025076435 0.101686551
## attacking_short_passing -0.148360162 0.069723984
## attacking_volleys -0.128600098 -0.098078062
## skill_dribbling -0.155804844 -0.063167006
## skill_curve -0.143659774 -0.039895559
## skill_fk_accuracy -0.126509388 -0.024262092
## skill_long_passing -0.125487893 0.096516197
## skill_ball_control -0.162693501 -0.006901399
## movement_acceleration -0.072708916 -0.095253248
## movement_sprint_speed -0.064027132 -0.082190200
## movement_agility -0.105614397 -0.084447787
## movement_reactions -0.131043993 0.084629586
## movement_balance -0.073426361 -0.073933592
## power_shot_power -0.136489683 -0.043406020
## power_jumping -0.010232627 0.079252791
## power_stamina -0.084355115 0.091153270
## power_strength -0.004567688 0.117585047
## power_long_shots -0.144640948 -0.071802689
## mentality_aggression -0.043592747 0.171457441
## mentality_interceptions -0.014079449 0.223048042
## mentality_positioning -0.143555102 -0.092398259
## mentality_vision -0.154591563 -0.027775141
## mentality_penalties -0.110843557 -0.091590110
## mentality_composure -0.137894200 0.054452687
## defending_marking_awareness -0.010257415 0.221490101
## defending_standing_tackle -0.003628388 0.223396363
## defending_sliding_tackle 0.003188535 0.220795425
## ls -0.162430468 -0.058564565
## st -0.162430468 -0.058564565
## rs -0.162430468 -0.058564565
## lw -0.169657617 -0.059577445
## lf -0.170018930 -0.059129286
## cf -0.170018930 -0.059129286
## rf -0.170018930 -0.059129286
## rw -0.169657617 -0.059577445
## lam -0.173197226 -0.039850438
## cam -0.173197226 -0.039850438
## ram -0.173197226 -0.039850438
## lm -0.173060439 -0.030320777
## lcm -0.167769487 0.060085213
## cm -0.167769487 0.060085213
## rcm -0.167769487 0.060085213
## rm -0.173060439 -0.030320777
## lwb -0.097181449 0.184855039
## ldm -0.079955298 0.208677637
## cdm -0.079955298 0.208677637
## rdm -0.079955298 0.208677637
## rwb -0.097181449 0.184855039
## lb -0.072772615 0.205754632
## lcb -0.030014201 0.231763787
## cb -0.030014201 0.231763787
## rcb -0.030014201 0.231763787
## rb -0.072772615 0.205754632
We can indeed conclude that:
- pca_1 gives a bigger and negative weight to the skills and salary
- pca_2 gives a bigger weight to height and weight
It is confirmed by a rapid hovering of the graph. The leftmost players (Mbappe, Messi, De Bruyne, Neymar) are known to be very good and very well paid. On the top we have Ruben Dias, Laporte, Van Dijk who are known to be very tall and powerful ; while on the bottom we have Muriel, Insigne or Coman wo are short and fast.
4.5 Where are the best players from ?
4.5.1 All the players
Where is football a popular sport ? Europe and South America for sure, but FIFA shows players coming from all around the world. This map will show where they are from.
<<<<<<< HEAD ======= >>>>>>> b0272877face21c11ce5805705be4bacf51de0a6We can see that Brazil, Argentina, Spain, France and Germany are the most represented countries, with around 1000 players each. But even though it is not the national sport, there are about 400 Americans and Chinese present in the game! On the contrary, football is very popular in Africa, but very few African players are represented. It is probably because the African leagues are not yet in the game.
4.5.2 Top 1000 players
It may be interesting to see if the countries with the most players are also the countries with the best players. For this purpose, we plot the same chart, but with only the 1000 players with the highest overall ranking.
<<<<<<< HEAD ======= >>>>>>> b0272877face21c11ce5805705be4bacf51de0a6We can see that Spain is by far the best country in 2022, folllowed by Brazil, Argentina and France. It is interesting to highlight that although Germany had 4 times the number of Italian players represented in FIFA (1200 VS 300), they both have approximately 50 players in the top 1000. Same remark for USA VS Canada.
